Dyr og Data

Introduction to statistics

Gavin Simpson

Aarhus University

Mona Larsen

Aarhus University

2024-10-22

Introduction

In science we collect data to test hypotheses

We use statistics

Inference

Statistical inference relates to using probabilities to make decisions about parameters of interest in a population from statistics (estimates) derived from a subset or sample from that population

Are the population means of two groups, control & treated, different?

Key concepts

  • test statistics & their distribution
  • null hypothesis testing
  • statistical significance
  • confidence intervals

Samples and populations

Samples and populations

Interested in the efficacy of a new drug for treating disease in animals

The population of interest is all the animals with the disease across herds in a sector

We could recruit the farmers managing all the herds in the sector into a study, and give each animal either the new drug or a control

Measure the recovery time of the two groups

Problems are:

  • would be far too costly,
  • would be far too time consuming, and
  • we’d never recruit every herd into the study

Samples and populations

An alternative approach is to sample from the population of interest

Take a random, representative subset of animals (or herds) from the population of interest

How representative the sample is depends on how it was collected

Sampling

Many statistical methods assume that the data represent a random sample from a population

Difficult to do in practice

  • need to identify every individual in the target population
  • some farmers or owners may refuse to be part of the study

Alternatively we could cluster sample (random stratified sampling); select n representative herds and then randomly sample individuals from within those herds

Sampling error

Statistics computed on samples will be rarely exactly equal to the population values

  • A sample of 20 male and 20 female penguins
  • Mean bill length in male Adelie = 40.54
  • Mean bill length in female Adelie = 37.36

Is there evidence of sexual dimorphism in Adelie penguin bill lengths?

Samples and populations

A distinction is often made between parameters and statistics

  • Parameters relate to the population: population mean
  • Statistics refer to the sample: sample mean

A sample statistic is an estimates of a population parameter

Interchange between statistics and parameter estimates

Probability

Probabilities describe the chance that an event will occur

Probabilities range from 0 to 1, with smaller values indicating the occurrence unlikely and larger values indicating occurrence is likely

Compute probabilities as \(\displaystyle \frac{m}{M}\); \(m\) is the number of possible outcomes of interest \(M\) is the number of all possible outcomes

Probability of a head on a single coin toss: \(\frac{1}{2} = 0.5\)

Probability of rolling an odd number on a 10-side die: \(\frac{5}{10} = 0.5\)

Probability of rolling a value less than 3 on a 10-sided die: \(\frac{2}{10} = 0.2\)

Probability

What is the probability of randomly selecting two samples with means as different as observed if the samples came from the same population?

Need to consider the sampling error:

  • samples drawn from populations with same mean, any difference is due to sampling error
  • samples drawn from populations with different means; difference is due to a real population difference

Probability

Earlier we saw a difference of means of the bill lengths of male and female Adelie penguins of 3.18

  • Assume that there is no difference between male and female Adelie penguin’s bill lengths
  • Under this assumption we can say \(\mu_{\text{male}} = \mu_{\text{female}}\)
  • Using computer we can see how likely a result the observed difference is:
    1. draw at random two samples of 20 values each from a population with mean = 40.54, the male mean bill length, (& same variance)
    2. record the difference in means of these two samples
  • repeat 100 times

Probability

Probability

How likely is a difference of means of >3? How likely is a difference of means of >0.5?

Read off from the cumulative distribution function of the sampling distribution of statistic

Null hypothesis

Often, the null hypothesis, \(\text{H}_0\), is that there is no effect

\(\text{H}_0\): there is no effect of sex on the bill length of Adelie penguins

This means that, under \(\text{H}_0\) we can rearrange the sex variable randomly, pairing any value of bill_length_mm with a male or female penguin

Under \(\text{H}_0\), these shufflings of the data should look similar to our observed data

Null hypothesis

Under \(\text{H}_0\), these shufflings of the data should look similar to our observed data

Do the shuffled data look like the observed data?

Permutation test

What we just did is the basis of a permutation test of \(\text{H}_0\)

A real test would shuffle — permute — the data many more times than 4

Permutation tests with infer 📦

adelie_shuffled <- withr::with_seed(2,
  adelie_bills |>
    specify(bill_length_mm ~ sex) |>
    hypothesize(null = "independence") |>
    infer::generate(reps = 1000, type = "permute")
  )
Response: bill_length_mm (numeric)
Explanatory: sex (factor)
Null Hypothesis: independence
# A tibble: 40,000 × 3
# Groups:   replicate [1,000]
   bill_length_mm sex    replicate
            <dbl> <fct>      <int>
 1           41.1 female         1
 2           39.6 female         1
 3           36.9 female         1
 4           40.1 female         1
 5           42.7 female         1
 6           38.1 female         1
 7           36.7 female         1
 8           40.6 female         1
 9           39.2 female         1
10           36.6 female         1
# ℹ 39,990 more rows

Permutation tests with infer 📦

With the infer 📦  we

  1. use specify() to say which variables are involved in the test and which is the response and which are the explanatory variables, and
  2. then use hypothesize() to indicate what are we testing — here we are testing the independence of bill_length_mm and sex, and then
  3. use generate() to create 10,000 permutations of the data, which are
  4. stored in adelie_shuffled

We also wrapped all that in withr::with_seed(), which is another way of setting random seeds for the RNG

Permutation tests with infer 📦

Now we have 1,000 randomly shuffled — permuted — data sets

Our null hypothesis (\(\text{H}_0\)) is that there is

no difference in the mean bill length of male and female Adelie penguins

For each permuted data set we need to calculate the

  1. mean bill_length_mm for the males — \(\hat{\mu}_{\text{Males}}\)
  2. mean bill_length_mm for the females — \(\hat{\mu}_{\text{Females}}\)
  3. the difference of these two means: \(\hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}}\)

This difference of means will be our test statistic

Permutation tests with infer 📦

We could do those calculations using dplyr but infer makes it even easier via calculate()

adelie_null_distribution <- adelie_shuffled |>
  calculate(stat = "diff in means", order = c("male", "female"))
  • "diff in means" is how we indicate we want a difference of means
  • order indicates the order of the means in the difference

The order doesn’t matter, so long as we are consistent

We did \(\hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}}\) so we will put male first

This generates the null distribution of the test statistic

Null distribution

The null distribution shows the distribution of the test statistic under \(\text{H}_0\) of no effect

Observed value

We also need to compute the observed value of the test statistic (i.e. the one for our data)

Do this using the same pipeline, without the hypothesize() and generate() steps

obs_diff_means <- adelie_bills |>
  specify(bill_length_mm ~ sex) |>
  calculate(stat = "diff in means", order = c("male", "female"))

obs_diff_means
Response: bill_length_mm (numeric)
Explanatory: sex (factor)
# A tibble: 1 × 1
   stat
  <dbl>
1  3.18

Null distribution

If we add the the observed value to the plot we can see how typical our observed difference of means is to the values we would expect if there was no effect pf sex

visualise(adelie_null_distribution, bins = 20) +
  shade_p_value(obs_stat = obs_diff_means, direction = "both")

One tail or two?

Whether a test is one- or two-tailed is a common distinction in research studies

Relates to the alternative hypothesis:

  • do we hypothesize only that the means are different?, or
  • do we hypothesize that the mean of the males is greater (less) than the females?

This is where concept of one- or two-tailed tests comes from; in which tail do we find the rejection region

Null distribution

If we add the the observed value to the plot we can see how typical our observed difference of means is to the values we would expect if there was no effect of sex

We asked for direction = "both" so the alternative hypothesis is that the males and females have different mean bill lengths

Alternative hypothesis

We typically denote the alternative hypothesis as \(\text{H}_1\) or \(\text{H}_{\text{a}}\)

This hypothesis is usually what we are interested in knowing about, but we can’t test it explicitly

\(\text{H}_{\text{a}}\) could be:

  1. that male and female Adelie penguins have different bill lengths on average, or
  2. that male Adelie penguins have longer bill lengths on average than females, or
  3. that male Adelie penguins have shorter bill lengths on average than females

Alternative hypothesis

  1. that male and female Adelie penguins have different bill lengths on average, or
  2. that male Adelie penguins have longer bill lengths on average than females, or
  3. that male Adelie penguins have shorter bill lengths on average than females

Option 1 is a two-sided test so we set direction to be "two-sided" or "both"

Option 2 is a one-sided test so we set direction to be "greater" or "right"

Option 3 is also a one-sided test but we set direction to be "less" or "left"

(for order = c("male", "female")!)

Alternative hypothesis

Before observing the penguins, if we had hypothesized that males had longer bills than females, we could have done a one-sided of the hypothesis

male Adelie penguins have longer bill lengths on average than females

\[\begin{align*} \text{H}_{\text{0}} & : \hat{\mu}_{\text{Males}} = \hat{\mu}_{\text{Females}} \\ \text{H}_{\text{a}} & : \hat{\mu}_{\text{Males}} > \hat{\mu}_{\text{Females}} \end{align*}\]

Alternative hypothesis

It might be easier to express our hypotheses in terms of the difference of means

Two-tailed

\[\begin{align*} \text{H}_{\text{0}} & : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} = 0 \\ \text{vs H}_{\text{a}} & : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} \neq 0 \end{align*}\]

One-tailed

"less"

\[\begin{align*} \text{H}_{\text{0}} & : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} = 0 \\ \text{vs H}_{\text{a}} & : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} < 0 \end{align*}\]

"greater"

\[\begin{align*} \text{H}_{\text{0}} & : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} = 0 \\ \text{vs H}_{\text{a}} & : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} > 0 \end{align*}\]

Alternative hypothesis

visualise(adelie_null_distribution, bins = 20) +
  shade_p_value(obs_stat = obs_diff_means, direction = "greater")

Don’t change your mind

You must decide what kind of \(\text{H}_{\text{a}}\) you want to test before you observe (or analyse) the data

You can’t use a two-sided alternative \[ \text{H}_{\text{a}} : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} \neq 0 \]

and reject \(\text{H}_{\text{0}}\), noting that males have longer bills than females and then decide to use the more powerful alternative

\[ \text{H}_{\text{a}} : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} > 0 \]

The p-value of the test

Using the null distribution we can obtain a p-value

the probability that a test statistic as or more extreme than the observed statistic would occur if the null hypothesis were true

We count how many difference of means in the null distribution are

  • equal to or greater than (in absolute value)
  • greater than, or
  • less than

the observed difference of means \(\hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}}\)

Why “in absolute value” for the two-sided test?

The p-value of the test

In our case, the p-value is reported to be 0

# p value for the simpler alternative of the bill lengths are different
adelie_null_distribution |>
  get_p_value(obs_stat = obs_diff_means, direction = "two-sided")
# A tibble: 1 × 1
  p_value
    <dbl>
1       0
# p value for the stronger alternative of male bill lengths are larger
adelie_null_distribution |>
  get_p_value(obs_stat = obs_diff_means, direction = "greater")
# A tibble: 1 × 1
  p_value
    <dbl>
1       0

This doesn’t make sense so instead we say the \(p\) is less than 0.001

\[\text{p value} = p = \frac{1}{P}\]

\(P\) is the number of permuted data sets we generated (1000 here)

Null Hypothesis Significance Testing

NHST

This is the basis of Null Hypothesis Significance Testing or NHST

The null hypothesis is the situation we test; no difference in means between groups, no relationship between x and y

Look for evidence against this null hypothesis; unlikely values of statistics if the null hypothesis were true

We permuted the data to generate the null distribution of the test statistic

the sampling distribution of the test statistic under \(\text{H}_0\)

Much theoretical work in statistics is in deriving sampling distributions of different statistics, or types of statistics, using maths instead of resampling (permuting)

Standard normal distribution

The standard normal distribution is a Gaussian distribution with \(\mathsf{\mu = 0}\) & \(\mathsf{\sigma^2 = 1}\)

People often call values from this distribution z-scores

Calculate z-scores by subtracting the mean score from each score and dividing by \(\sigma\)

\[z_i = \mathsf{\frac{\text{score}_i - \text{mean}}{\text{standard deviation}}}\]

\[z_i = \mathsf{\frac{14 - \text{14.95}}{\text{5.47}}} = \mathsf{-0.17}\]

Negative z-scores lie below the sample mean, positive scores above it

A z-score of -22.52 tells us that the score of 2mm is 22.52 standard deviations below the mean

Standard normal distribution

68% probability z-score lies between -1–1; 95% probability z-score lies between -1.96–1.96

The p-value

Probability distributions, like the standard normal or t distribution, help us compute the probability of obtaining a statistics if the null hypothesis were true

This is the probability due the chance

Under standard normal, probability of observing a score of 2.5 is 0.006

Under t distribution with 5 df, probability of observing a score of 2.5 is 0.027

t distribution often used for small samples & where sample variance is estimated (not know) which for many samples will be an underestimate of the true variance

The p-value tells us nothing about the probability of the null hypothesis or the alternative hypothesis

t and normal distributions

Redrawn from Krzywinski & Altman (2013) Significance, P values and t-tests. Nature Methods 10(11) 1041–1042 doi: 10.1038/nmeth.2698

One- and two-tailed tests

Whether a test is one- or two-tailed is a common distinction in research studies

Relates to the hypothesis:

  • do we hypothesize only that the means are different?, or
  • do we hypothesize that the mean of the treated group is greater (less) than the control?

This is where concept of one- or two-tailed tests comes from; in which tail do we find the rejection region

Theoretical test

The classical test for a difference of means is known as a two-sample t test


    Two Sample t-test

data:  bill_length_mm by sex
t = -5.4197, df = 38, p-value = 3.557e-06
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -4.374686 -1.995314
sample estimates:
mean in group female   mean in group male 
              37.355               40.540 

Theoretical test

The two-sample t test is a special case of a linear model

m <- lm(bill_length_mm ~ sex,
  data = adelie_bills)
summary(m)

Call:
lm(formula = bill_length_mm ~ sex, data = adelie_bills)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2400 -1.0262  0.0025  0.8350  4.8450 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.3550     0.4155   89.89  < 2e-16 ***
sexmale       3.1850     0.5877    5.42 3.56e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.858 on 38 degrees of freedom
Multiple R-squared:  0.436, Adjusted R-squared:  0.4211 
F-statistic: 29.37 on 1 and 38 DF,  p-value: 3.557e-06

Theoretical test

The two-sample t test is a special case of a linear model

library("marginaleffects")

avg_comparisons(m)

 Estimate Std. Error    z Pr(>|z|)    S 2.5 % 97.5 %
     3.19      0.588 5.42   <0.001 24.0  2.03   4.34

Term: sex
Type:  response 
Comparison: mean(male) - mean(female)
Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 

We will focus on linear models on this course

Type I and Type II errors

In NHST we obtain the probability (\(p\)) of observing as extreme a statistic as the observed if the null hypothesis were true: evidence against the null

If \(p\) is small we reject the null hypothesis & accept the alternative: chance that we falsely reject the Null

If \(p\) is large we fail to reject the null hypothesis & thus reject the alternative: chance that we falsely fail to reject the Null

No effect in population
  • Reject null hypothesis: Type I error
  • Fail to reject null hypothesis: correct decision
Effect in population
  • Reject null hypothesis: correct decision
  • Fail to reject null hypothesis: Type II error

Statistical Significance

  • What is a small p-value?
  • A typical value is p <= 0.05, which means 5% Type I error rate, or 5 times in 100 you’d expect to see as extreme a statistic as the observed purely due to chance if \(\text{H}_0\) is true
  • In other words; you’d falsely reject the null hypothesis, on average, 5 times in 100
  • Might be OK for some studies or research
  • Probably not OK for making life & death decisions or declaring you’ve found a new elementary particle
  • Particle physicists use 5 sigma: need a p <= ~0.000000287 to reject the null!

Randall Munroe http://xkcd.com/1478/

Problems with NHST: Effect Size

A major problem with peoples’ use of NHST is the over-emphasis put on the p-value

The p-value is just a measure of conditional probability; just because you get a low p-value doesn’t mean the effect is important

Increase the sample size and, all things equal, you can detect smaller effects (statistically significant effects)

Important to ask if the effect size is important or relevant?

Effect size refers to the

  • magnitude of the difference between groups
  • magnitude of the relationship between \(x\) and \(y\)

Problems with NHST: power

Statistical power is the ability of a study to detect a real effect

Power is a function of the Type II error:

\[\mathsf{power} = 1 - \beta\]

Power ranges from 0 (no chance of detecting the real effect) to 1 (100% chance of detecting the real effect)

There is no point in conducting a study, especially one involving animals, that has low chance of finding the real effect

Problems with NHST: power

Power is a function of several important factors

  1. Sample size — the larger the sample size the greater the power to detect an effect of a given size
  2. Criterion for significance (\(\mathsf{\alpha = 0.05}\) say) — the lower \(\mathsf{\alpha}\), the less likely you are to reject the null hypothesis
  3. One- or two-tailed tests — one-tailed tests are typically more powerful than two-tailed ones; making a stronger statement about the expected effect
  4. Effect size — the bigger the effect the study tries to detect the greater the power

For simple models (t-tests, one-way ANOVA, etc) there are relatively simple equations for computing the power of a planned study for given sample size, effect size etc.

For more complex models, power calculations become difficult; simulation using a computer is a solution

Confidence Intervals

Confidence Intervals

In NHST, too much emphasis is placed on just the sample statistic (mean, difference of means, regression slope \(\beta_j\), etc)

A different way of presenting statistical results is the confidence interval

Confidence intervals can be interpreted as (from Wikipedia)

  • Were this procedure to be repeated on multiple samples, the calculated confidence interval (which would differ for each sample) would encompass the true population parameter 95% of the time
  • There is a 95% probability that the calculated confidence interval from some future experiment encompasses the true value of the population parameter
  • The confidence interval represents values for the population parameter for which the difference between the parameter and the observed estimate is not statistically significant at the 5% level

Confidence intervals

As with the resampling (permutation) and theoretical approaches we used for hypothesis testing we can compute a resampling or theoretical version of a confidence interval

The resampling version uses a non-parametric bootstrap procedure

Bootstrap

The standard error of the mean (SEM)

the estimated standard deviation of the means of all possible samples of the variable

Rather than use this theoretical result we could resample the data to generate the sampling distribution

In stead of permuting the data (resampling without replacement) we’ll use resampling with replacement

With replacement means that each time we draw a single data point to add to our sample, we put it back so it can be drawn again

Bootstrap

A single bootstrap sample then is the same size as our data

But it will include repeats of some of the observed values

This also means some values in the data won’t be in a bootstrap sample

Bootstrap

A single bootstrap sample can be generated using rep_sample_n()

Set replace = TRUE to get a bootstrap sample instead of a permutation

adelie_bills |>
  rep_sample_n(
    size = 40,
    replace = TRUE
  ) |>
  select(replicate, bill_length_mm, sex)
# A tibble: 40 × 3
# Groups:   replicate [1]
   replicate bill_length_mm sex   
       <int>          <dbl> <fct> 
 1         1           36   female
 2         1           39.1 male  
 3         1           41.3 male  
 4         1           38.2 male  
 5         1           40.6 male  
 6         1           40.6 male  
 7         1           40.8 male  
 8         1           38.8 male  
 9         1           33.5 female
10         1           37.9 female
# ℹ 30 more rows

Bootstrap

To generate multiple samples, set reps

adelie_bills |>
  rep_sample_n(
    size = 40,
    replace = TRUE,
    reps = 21
  ) |>
  select(replicate, bill_length_mm, sex)
# A tibble: 840 × 3
# Groups:   replicate [21]
   replicate bill_length_mm sex   
       <int>          <dbl> <fct> 
 1         1           39.6 female
 2         1           37.3 female
 3         1           42.2 female
 4         1           35.9 female
 5         1           41.6 male  
 6         1           39.7 female
 7         1           43.2 male  
 8         1           41.3 male  
 9         1           36.3 male  
10         1           43.2 male  
# ℹ 830 more rows

Bootstrap

Bootstrap CI

Can use the bootstrap to generate a CI for the mean bill length of the male Adelie penguins

male_boot <- adelie_bills |>
  filter(sex == "male") |>
  specify(response = bill_length_mm) |>
  generate(
    reps = 1000,
    type = "bootstrap") |>
  calculate("mean")

male_boot |>
  visualize()

Bootstrap CI

The percentile CI uses percentiles of the distribution. E.g. for 95% CI

  • lower endpoint is 0.025 quantile (2.5%)
  • upper endpoint is 0.975 quantile (97.5%)
male_ci <- male_boot |>
  get_ci(type = "percentile", level = 0.95)

male_boot |>
  visualize() +
  shade_ci(endpoints = male_ci)

Interval: 39.8 – 41.24mm

Bootstrap

Difference of means of 21 bootstrap samples of the Adelie penguin bill lengths

adelie_bills |>
  specify(bill_length_mm ~ sex) |>
  generate(
    reps = 21,
    type = "bootstrap") |>
  calculate(
    stat = "diff in means",
    order = c("male", "female")) |>
  visualize()

21 samples is too small to do much with

Bootstrap

Difference of means of 1000 bootstrap samples of the Adelie penguin bill lengths

boot_distn <- adelie_bills |>
  specify(bill_length_mm ~ sex) |>
  generate(
    reps = 1000,
    type = "bootstrap") |>
  calculate(
    stat = "diff in means",
    order = c("male", "female"))

visualize(boot_distn)

Percentile CI

pc_ci <- boot_distn |>
  get_ci(type = "percentile", level = 0.95)

pc_ci
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1     1.98     4.33

Does the confidence interval include 0?

Percentile CI

boot_distn |>
  visualize() +
  shade_ci(endpoints = pc_ci)

Theoretical CIs

Theoretical CI for the mean bill length for the male penguins

  • Mean of the males — 40.54; SD of the males — 1.71

  • Standard error of mean\(\frac{\hat{\sigma}}{\sqrt{n}} = \frac{1.71}{\sqrt{20}} = \frac{1.71}{4.47} = 0.38\)

  • Critical value of t for 95% test — 2.093

    • qt(0.025, df = 20-1, lower.tail = FALSE)
  • 95% Confidence interval is

    \[\mathsf{CI} = \overline{y} \pm (2.093 \times \mathsf{SEM})\]

  • Lower CI limit — 39.74

  • Upper CI limit — 41.34

Theoretical CIs

Theoretical CIs

The theoretical 95% CI for the difference of means for the male and female Adelie penguins is: 1.99 – 4.38

The bootstrap 95% CI is: 1.98 – 4.33

Resampling inference

Resampling-based inference is a general approach to inference in statistics

Theoretical approaches developed in era of no computers

Resampling methods exploit fast computation

Fewer assumptions, but some assumptions remain

Theoretical approaches still have their place

How good are diagnostic tests?

Diagnostic testing — say for SARS-CoV-2 — is typically thought of as

  1. someone either has the disease or not,
  2. they take a test which comes out either positive or negative,
  3. that test tells us the truth

What does “true positive” mean? Carrying the virus, infectious, …?

What is a “positive” results depends on the type of test (PCR or rapid lateral flow)

All tests can give a “wrong” result

True & False positives & negatives

A false negative is a negative test result when patient has the disease

A false positive is a positive test result when the patient doesn’t have the disease

Two-way table

Disease No disease
+ test TP FP
- test FN TN

Accuracy of test results

We describe the accuracy of test results through two quantities

  • TPR — True-positive rate

the proportion of people with the virus who get a positive test result

We also call this sensitivity

  • TNR — True-negative rate

the proportion of people not infected with the virus who get a negative test result

We also call this specificity

Accuracy of test results

Confusingly, however, we typically use the complements of these values

False positive rate: 1 - sensitivity

False negative rate: 1 - specificity

False positive & negative rates

At the end of June 2020 in the UK less than 0.05% of PCR test results were positive

Even if no virus circulating, the estimated false positive rate could be no larger than 0.05%

For every 20,000 people without the virus, we would expect just one to test positive

It was also estimated that between 5% and 15% of positive test results were false (FN)

Similar values could be reported for lateral flow devices

Test conclusions

A positive test result doesn’t mean you are infected

What’s more important for people is the probability that you are infected if you test positive

More generally, what does a particular test result mean for an individual?

These are predictions, conditional upon a given test result

Test conclusions

Suppose a lateral flow test has a TPR = 60% and TNR of 99.9% (a FPR or 1 in 1,000)

Suppose further that the virus is at levels in early March 2021, when ~0.3% of the UK population had an infection

What happens if we test 1,000,000 people?

3,000 people have the disease (0.3% of 1,000,000) of whom 1,800 test positive (60% of 3,000)

997,000 people don’t have the disease but 997 (0.1%) get a false positive test result

The FPR is 1:1,000 yet 36% (1,800 / (1,800 + 997)) of positive test results are false

Test conclusions

flowchart LR
  poptested["1,000,000"]
  disease["3,000"]
  nodisease["997,000"]
  pop((Population)) --> |tested | poptested
  poptested --> | disease | disease
  poptested --> | no disease | nodisease
  disease --> test1{Test}
  test1 --> | positive | tp["1,800"]
  test1 --> | negative | fn["1,200"]
  nodisease --> test2{Test}
  test2 --> | positive | fp["997"]
  test2 --> | negative | tn["996,003"]
  tp --> tps[TP]
  fn --> fns[FN]
  fp --> fps[FP]
  tn --> tns[TN]

Test conclusions

The FPR is 1:1,000 yet 36% (1,800 / (1,800 + 997)) of positive test results are false

This is confusing to many people

as a disease gets rarer, more positive test results will be false

This is an application of Bayes’ Theorem

Shows how important the underlying prevalence of disease is in interpretting test results

Conditional probability

Mamography is ~90% accurate for screening breast cancer

Assuming 1% of screened women have breast cancer

flowchart LR
  popscreened["1,000"]
  cancer["10"]
  nocancer["990"]
  pop((Population)) --> | screened | popscreened
  popscreened --> | cancer | cancer
  popscreened --> | no cancer | nocancer
  cancer --> test1{Test}
  test1 --> | positive | tp[9]
  test1 --> | negative | fn[1]
  nocancer --> test2{Test}
  test2 --> | positive | fp[99]
  test2 --> | negative | tn[891]
  tp --> tps[TP]
  fn --> fns[FN]
  fp --> fps[FP]
  tn --> tns[TN]

Shows the harm screening can cause if underlying risk is low

Evidence based medicine

Evidence-based veterinary medicine is the formal strategy to integrate the best research evidence available combined with clinical expertise as well as the unique needs or wishes of each client in clinical practice. Much of this is based on results from research studies that have been critically-designed and statistically evaluated.

5 steps

  1. Translation of uncertainty to an answerable question
  2. Systematic retrieval of the best evidence available
  3. Critical appraisal of evidence for internal validity:
    • Systematic errors as a result of selection bias, information bias and confounding
    • Quantitative aspects of diagnosis and treatment
    • The effect size and aspects regarding its precision
    • Clinical importance of results
    • External validity or generalizability
  4. Application of results in practice
  5. Evaluation of performance

5 steps

  1. Ask — This step is about identifying the right questions the veterinary surgeons need answers to
  2. Acquire — This step is about obtaining evidence on the subject of interest. This involves systematic searching for existing literature, and where there is no evidence, undertaking new studies to answer question of interest
  3. Appraise — This step involves appraising the literature for quality and sources of bias that may affect the believability of the results
  4. Apply — This step involves applying the evidence to practice, where appropriate
  5. Audit — This step is all about assessing whether the application of the new evidence has affected the outcome of interest

Levels of evidence

Current gold standard is the randomised controlled trial (RCT)

  • Level I: evidence obtained from at least 1 properly designed RCT
  • Level II-1: evidence obtained from well-designed controlled trials without randomization
  • Level II-2: evidence obtained from well-designed cohort studies or case-control studies, preferably from more than one center or research group
  • Level II-3: evidence obtained from multiple time series designs with or without the intervention. Dramatic results in uncontrolled trials might also be regarded as this type of evidence
  • Level III: opinions of respected authorities, based on clinical experience, descriptive studies, or reports of expert committees

Criticisms

  • Disconnect between the evidence from RCTs at level of groups of people vs what’s best for an individual patient
  • May not be possible to do RCTs, meta-analysis, expert reviews for all diseases
  • Evidence tends to be in studies of single diseases while many situations patients present with multiple comorbidities
  • Dealing with biases in what is researched and published
  • Lags between research being done and published, and published and put into practice